A Coalescent Model of Recombination Hotspots Carsten Wiuf

نویسنده

  • David Posada
چکیده

Recent experimental findings suggest that the assumption of a homogeneous recombination rate along the human genome is too naive. These findings point to block-structured recombination rates; certain regions (called hotspots) are more prone than other regions to recombination. In this report a coalescent model incorporating hotspot or block-structured recombination is developed and investigated analytically as well as by simulation. Our main results can be summarized as follows: (1) The expected number of recombination events is much lower in a model with pure hotspot recombination than in a model with pure homogeneous recombination, (2) hotspots give rise to large variation in recombination rates along the genome as well as in the number of historical recombination events, and (3) the size of a (nonrecombining) block in the hotspot model is likely to be overestimated grossly when estimated from SNP data. The results are discussed with reference to the current debate about block-structured recombination and, in addition, the results are compared to genome-wide variation in recombination rates. A number of new analytical results about the model are derived. THE process of recombination in humans has been 1985; see Griffiths 1981 for a two-locus model). The intensively debated over the last years. Various reidea is that recombination breakpoints are not chosen cent findings suggest that the standard model assuming randomly along the chromosomes but are concentrated a flat rate of recombination along a chromosome is too in certain regions of the chromosomes. One way to crude an approximation to the actual recombination model this is to choose centers of recombination activity process acting on the human genome and that the stan(i.e., hotspots) according to some point process (e.g., a dard model does not adequately explain the findings. Poisson process) and let recombination events happen In Daly et al. (2001), Jeffreys et al. (2001), Johnson at a rate descending from the centers. In the following et al. (2001), and Gabriel et al. (2002) it is argued that a model along these lines is developed. In the next recombination tends to happen more often in certain section an informal description of the model is preregions, so-called hotspots, of a chromosome than in sented followed by a mathematical treatment with comother regions, giving rise to long islands of nonrecomparisons to the standard model. A scheme for simulation bining or virtually nonrecombining genetic material. of sequence samples and histories is described. Some If the above reports are true, our understanding of familiarity with the coalescent with recombination is the recombination process as an evolutionary force must required. be adjusted accordingly: Modeling of recombination This report is intended to be methodological, where and interpretation of recombination patterns plays an issues of relevance to the analysis of data are addressed. important role in the analysis of genetic data. In this The new model is compared to the coalescent model report we develop a model, the coalescent with recombiwith uniform recombination rate through simulations nation hotspots, which can be used for simulation and of various summary statistics. Of special interest are the analysis of genetic data. Simulation of genetic data is an consequences of ignoring hotspot recombination and important tool for investigating and testing hypotheses how hotspot recombination affects the genome-wide about how genetic data have been shaped and is a useful variation in recombination rates. Various issues relating way of gaining intuition about and insight into the conto inference in the hotspot model are raised in the sequences of evolutionary processes. discussion. The coalescent with recombination hotspots is an extension of Kingman’s (1982) coalescent and of the coalescent with recombination in various forms, the coalesA MODEL OF RECOMBINATION HOTSPOTS cent with uniform recombination rate and multilocus Think of an entire chromosome as being represented coalescent models (Hudson 1983; Hudson and Kaplan by a line and the gene or region we are interested in as being represented by the interval (0, 1), as illustrated in Figure 1. Throughout we use “gene” in a loose sense, 1Corresponding author: Variagenics, 60 Hampshire St., Cambridge, MA 02139. E-mail: [email protected] letting it be short for an arbitrary but fixed region in Genetics 164: 407–417 (May 2003) 408 C. Wiuf and D. Posada Figure 1.—Chromosome and gene. the genome. Thus, a gene might be several thousand kilobases long. For a gene L nucleotides long each nucleotide takes up 1/L of the interval. However, for mathematical convenience we think of the gene as a continuFigure 3.—Example of gene with two hotspots at x1 0.2 ous stretch of points. and x2 0.65. In addition, there is a hotspot outside the gene The model we outline below is fairly general. In at x 1 0.15. Given that a recombination happens around xj, j 1, 1, 2, the breakpoint is chosen according to a normal mathematical exposition we restrict the model to a density N(0, j ) with j 0.01, 0.015, and 0.01, respectively. simpler model that still has most of the flexibility of The three curves shown (one only partly) are all normal. the general model. Choose hotspots according to some point process. Perhaps the simplest and most sensible process in this connection is a Poisson process with gj(x) could be uniform on ( j, j). The parameters intensity 0, so that on average there are x hotspots 2 j and j could be chosen from a set of values or from in a chromosomal segment of length x, and in particular a distribution. Potentially, this results in a model with there are hotspots on average in the gene (0, 1). many parameters stemming from the point process, the In this fashion hotspots are scattered throughout the distribution of cj, and the specification of gj(x). Whether chromosome and different genes will have different a hotspot, xj, is “hot” or “cold” (as used by, e.g., Rosennumbers of hotspots, but different copies of the same berg and Nordborg 2002) depends on the two dimenchromosome will have the exact same number and the sions, cj and gj(x): cj determines the absolute rate of exact same locations of hotspots. In Figure 2 hotspots recombination in the region near xj, whereas gj(x) deterare labeled xj, j 1, 2, . . . . If we knew the exact mines the relative rates of recombination for positions locations of the hotspots, e.g., from experiments, these near xj. The “hottest” hotspots are obtained with high would not have to be modeled stochastically. In the cj and very peaked gj(x); the “coolest” are obtained with absence of such knowledge the point process reflects low cj and flat gj(x). our prior information or expectation of how hotspots Two hotspots are in the example given in Figure 3, are distributed throughout the chromosome. one at x1 0.2 and the other at x2 0.65, and gj(x), Recombination crossovers happen around a particuj 1, 2, are normal with variances 2 1 0.015 and lar hotspot, xj, with a rate, cj, per generation and when 2 2 0.01, respectively. Thus, most breakpoints occur a crossover occurs the breakpoint is chosen according near the hotspots but some might fall farther away. to a distribution, gj(x), around xj. We choose to call xj There is little chance that a breakpoint around x2 falls a hotspot, though a more correct terminology might be to the left of x1 and vice versa, but some chance that a “center of recombination activity.” Unless the distribua breakpoint around x1 falls outside the gene and in tion gj(x) is closely centered around xj, few recombinaconsequence the recombination event does not affect tions would be at xj precisely. We say that recombination the evolution of the gene. Also, there is positive probahappens around xj if the breakpoint is chosen from bility that a hotspot located outside the gene at x 1 gj(x). The rates cj could be chosen from some distribu0.15 (not shown in Figure 3) gives rise to a breakpoint tion, e.g., a , or be constant for all j, cj c. In the that is within the gene (indicated by the dotted line at former case we talk about rate heterogeneity, and in the left in Figure 3). the latter about rate homogeneity. Similarly, gj(x) might Since gj(x) is proportional to the probability by which vary with j or be independent of j, gj(x) g(x). For recombination happens at distance x from the hotspot example, gj(x) could be normal with variance 2 j , or xj, the sum of the curves in Figure 3 represents the overall rate of recombination in a given point (Figure 4). If gj(x) is sufficiently narrow around hotspots little overlap with other hotspots occurs, resulting in truly distinguishable peaks. The following interpretation of the model is intuitive: A hotspot xj can be thought of as a specific site or segment that is required for recombination to take Figure 2.—Genes and hotspots. Each point xj, j 1, 2, place; however, the breakpoint itself might not be at . . . , represents a hotspot. Those to the left of 0 are indexed by negative integers, those to the right by positive integers. the hotspot or fully determined by the hotspot, but just

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Coalescent With Gene Conversion Carsten Wiuf

In this article we develop a coalescent model with intralocus gene conversion. The distribution of the tract length is geometric in concordance with results published in the literature. We derive a simulation scheme and deduce a number of analytical results for this coalescent with gene conversion. We compare patterns of variability in samples simulated according to the coalescent with recombin...

متن کامل

Recombination as a point process along sequences.

Histories of sequences in the coalescent model with recombination can be simulated using an algorithm that takes as input a sample of extant sequences. The algorithm traces the history of the sequences going back in time, encountering recombinations and coalescence (duplications) until the ancestral material is located on one sequence for homologous positions in the present sequences. Here an a...

متن کامل

Simulating haplotype blocks in the human genome

SUMMARY A bioinformatic tool was written to simulate haplotypes and SNPs under a modified coalescent with recombination. The most important feature of this program is that it allows for the specification of non-homogeneous recombination rates, which results in the formation of the so-called 'haplotype blocks' of the human genome. The program also implements different mutation models and flexibl...

متن کامل

Consistency of estimators of population scaled parameters using composite likelihood.

Composite likelihood methods have become very popular for the analysis of large-scale genomic data sets because of the computational intractability of the basic coalescent process and its generalizations: It is virtually impossible to calculate the likelihood of an observed data set spanning a large chromosomal region without using approximate or heuristic methods. Composite likelihood methods ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003